The 2015 Canadian federal election is in its final stretch and at Datacratic, we thought it would be a great opportunity to collect and play with some data.
Each party has a different view on what is important to move Canada forward. This has to effect of centering their official message around certain themes. The question we asked is: can we find patterns in the words used in each party’s press releases? Do some deal more with certain themes than others?
The answer is Yes.
This image is a static scatter plot of press release. Each dot represents a press release, colored by each party’s official color, and press releases are positioned such that the ones that are similar are close-by. The axis are unitless.
The labels (and pink dots) represent major themes that we hand selected and then projected in the 2D space computed by our algorithms to help interpret of the space.
The rest of this notebook is the exact code that was used to generate the scatter plot with MLDB along with an interactive plot at the bottom. The algorithms used to produce this image include word2vec and t-SNE.
In [458]:
import json, pandas, requests
from datetime import datetime
import numpy
from pymldb import Connection
mldb = Connection()
We wrote a small scraper using the scrapy Python library. We scraped the press releases of the NDP, the Liberals, the Green Party and the Conservatives. The resulting data was save in a CSV file with one press release per line.
As mentioned above, we added a few lines representing the major themes we wish to project in our space.
In [717]:
ds = mldb.v1.datasets("raw")
ds.put({
"type": "text.csv.tabular",
"params": {
"dataFileUrl":"file:///mldb_data/all_parties_clean.csv",
"ignoreBadLines": True,
"rowNamePrefix": "pr"
}
})
Out[717]:
In [718]:
mldb.query("select party, count(*) from raw where to_timestamp(date) > to_timestamp('2015-08-01') group by party")
Out[718]:
In [719]:
mldb.query("select * from raw limit 1")
Out[719]:
The next step is to take the full_text
column and tokenize its contents. We do this because we will need to compare press releases using their content.
The tokenize
function will split the column into individual words, remove words with a length of less than 3 and apply a stoplist to it.
In [720]:
bag = mldb.v1.procedures("baggify")
bag.put({
"type": "transform",
"params": {
"inputDataset": "raw",
"outputDataset": {
"id": "bag_of_words",
"type": "beh.mutable"
},
"select": """tokenize(full_text, {' ?!;/[]*"' as splitchars, '' as quotechar}) as *""",
"where": "full_text IS NOT NULL AND title IS NOT NULL AND to_timestamp(date) > to_timestamp('2015-08-01')"
}
})
mldb.v1.datasets("bag_of_words").delete()
bag.runs.post({})
Out[720]:
The result is a clean list of words that will be easy to compare across press releases.
In [721]:
mldb.query("select * from bag_of_words limit 1")
Out[721]:
In [722]:
df = mldb.query("select sum({*}) as * from bag_of_words")
df2 = df.T
df2.columns = ["count"]
In [723]:
df2.sort(columns="count", ascending=False)[:25]
Out[723]:
The word2vec tool, described here and here, is used to embed words into a high dimensional space.
Word2vec gives us word embeddings, but since what interests us is press releases and each one is represent by a bag of words, we actually need document embedding. This means that for a given press release, we need to embed each of its words and then combine them all to get a representation of the press release in the word2vec space. To acheive this, we simply averaged out each dimension of each press release's word embeddings to get the coordinates for the press release.
In [724]:
mldb.v1.datasets("w2v").delete()
w2v = mldb.v1.procedures("w2vimport")
w2v.put({
"type": 'import.word2vec',
"params": {
"dataFileUrl": 'file:///mldb_data/GoogleNews-vectors-negative300.bin',
"output": {
"type": 'embedding',
"id": 'w2v'
}
}
})
w2v.runs.post({})
Out[724]:
In [725]:
mldb.query("select * from w2v limit 5")
Out[725]:
In [726]:
ds = mldb.v1.datasets("tranposed_bag_of_words")
ds.put({
"type": "transposed",
"params": {
"dataset": {"id": "bag_of_words"}
}
})
Out[726]:
In [727]:
proc = mldb.v1.datasets("bow_merged")
proc.put({
"type": "merged",
"params": {
"datasets": [{"id": "tranposed_bag_of_words"}, {"id": "w2v"}]
}
})
Out[727]:
In [776]:
rows = requests.get("http://localhost/v1/datasets/bag_of_words/rows").json()
print rows[:5]
The bow_merged
dataset now contains both columns representing the embedding for each word as well as the pr*
columns, representing how many times the given word appeared in the given press release.
In [778]:
mldb.query("select * from bow_merged limit 1")
Out[778]:
Doing the following query will compute the average of the embeddings for the release represented by rows[0]
. The >0
condition translates to: "for the words present in the release".
In [777]:
mldb.query("select avg({* EXCLUDING(pr*)}) as * from bow_merged where "+rows[0]+">0")
Out[777]:
In [729]:
import urllib
import multiprocessing.dummy
def calc_embedding(pr):
return (pr, mldb.query("select avg({* EXCLUDING(pr*)}) as * from bow_merged where "+pr+">0"))
p = multiprocessing.dummy.Pool(25)
new_res = p.map(calc_embedding, rows)
We can now save the results in a new dataset.
In [771]:
col_names = requests.get("http://localhost/v1/datasets/w2v/columns").json()
ds = mldb.v1.datasets("pr_embed")
ds.put({
"type":"embedding",
})
for pr, line in new_res:
if len(line) == 0: continue
row = []
for col_name, cell in zip(col_names, line):
row.append([col_name, float(line[cell]), 0])
try:
ds.rows.post({
"rowName": pr,
"columns": row
})
except:
print pr, len(line)
break
ds.commit.post({})
Out[771]:
In [743]:
mldb.query("select * from pr_embed where rowName() = '"+rows[0]+"'")
Out[743]:
In [762]:
tsne = mldb.v1.procedures("pr_embed_tsne")
tsne.put({
"type" : "tsne.train",
"params" : {
"dataset" : {"id" : "pr_embed"},
"output" : {"id" : "pr_embed_tsne", "type" : "embedding"},
"select" : "*",
"where" : "true",
"modelFileUrl": "file:///mldb_data/tsne.bin",
"functionName": "tsne_embed",
"perplexity": 5
}
})
mldb.v1.datasets("pr_embed_tsne").delete()
tsne.runs.post({})
Out[762]:
In [763]:
mldb.v1.datasets("pr_embed_tsne_merged").put({
"type" : "merged",
"params" : {
"datasets": [
{"id": "raw"},
{"id": "pr_embed_tsne"}
]
}
})
Out[763]:
In [764]:
mldb.query("""
select party, title, x, y
from pr_embed_tsne_merged
where to_timestamp(date) > to_timestamp('2015-08-01')
limit 5""")
Out[764]:
In [765]:
df = mldb.query("""
select party, title, x, y
from pr_embed_tsne_merged
where to_timestamp(date) > to_timestamp('2015-08-01')
""")
In [766]:
import numpy as np
colormap = {
"ndp": "#FF8000",
"liberal": "#DF0101",
"conservative": "#0000FF",
"green": "#01DF01",
"category": "#FE2EC8"
}
import bokeh.plotting as bp
from bokeh.models import HoverTool
In [767]:
#this line must be in its own cell
bp.output_notebook()
In [773]:
press_releases = np.array([str(x.encode('ascii','ignore').split("|")[0]) for x in list(df.title.values)])
x = bp.figure(plot_width=900, plot_height=700, title="Press Releases of Canadian Federal Parties During 2015 Elections",
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
x_axis_type=None, y_axis_type=None, min_border=1)
x.scatter(
x = df.x.values,
y = df.y.values,
color=[colormap[k] for k in df.party.values],
radius=1,
fill_alpha=0.5,
source=bp.ColumnDataSource({"title": press_releases})
).select(dict(type=HoverTool)).tooltips = {"title":"@title"}
bp.show(x)
This plot is interactive. Feel free to explore it by hovering over the different press releases. To tooltip shows the title of each press release.
In [ ]: